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NEAREST  NEIGHBOR  CLASSIFICATION  OF  STATIONARY  TIME  SERIES: 
AN  APPLICATION  TO  ANESTHESIA  LEVEL 
CLASSIFICATION  BY  EEG  ANALYSIS 


By 

Will  Gersch 


1.  INTRODUCTION. 

This  paper  presents  the  theory  and  a  prototypic  example  of  an  explora¬ 
tory  population  screening-stationary  time  series  classification  problem. 

In  the  population  screening  problem  a  new  individual  is  classified  by 
comparing  measurements  obtained  from  him  with  measurements  obtained  from 
other  individuals  in  the  alternative  categorical  states.  Human  electro¬ 
encephalogram  (EEG)  time  series  were  obtained  during  surgery  simultaneously 
with  an  anesthesiologist's  appraisal  of  the  level  of  anesthesia  from  a 
moderate  but  not  large  number  of  individuals.  These  EEG  time  series  are 
considered  to  be  a  set  of  labeled  sample  time  series.  The  categorical 
time  series  classes  are  characterized  by  broad  intersubject  time  series 
variations.  An  implicit  conjecture  in  this  data  gathering  experiment  is 
that  there  is  sufficient  information  in  the  EEG  time  series  to  reliably 
classify  the  level  of  anesthesia  of  humans  in  surgery.  Our  objectives 
are  to  assess  the  separability  of  the  time  series  populations,  i.e. 
to  obtain  a  statistically  reliable  estimate  of  the  minimum  achievable 
probability  of  misclassif ication  of  new  time  series  and  to  implement  a 
time  series  classification  rule  that  can  achieve  those  statistical 


properties. 
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A  nearest  neighbor  time  series  classification  rule  achieves  those 
objectives.  With  that  rule  a  measure  of  dissimilarity  is  computed 
between  a  new  to-be-classified  time  series  and  each  of  a  set  of  categori¬ 
cally  labeled  time  series.  The  new  time  series  is  classified  with  the 
label  of  its  least  dissimilar  neighbor.  In  our  approach  the  dissimilarity 
measure  between  time  series  is  an  estimate  of  the  Kullback  Leibler  number 
between  the  time  series  as  if  the  time  series  were  normally  distributed. 
This  dissimilarity  measure  is  shown  to  have  sufficient  metric  properties 
for  the  formal  Cover  and  Hart  1967  asymptotic  nearest  neighbor  and  Rogers 
1977  finite  sample  nearest  neighbor  classification  rule  properties  to  hold. 
Those  properties  allow  the  conjecture,  that  there  is  sufficient  information 
in  the  EEG  time  series  to  reliably  classify  the  level  of  anesthesia  of 
humans  in  surgery,  to  be  tested  with  only  a  moderate  number  of  labeled 
sample  EEG  time  series. 

The  nearest  neighbor  Kullback  Leibler  type  dissimilarity  measure 
classification  rule  (NN-KL)  method  is  applied  to  the  classification  of 
the  level  of  anesthesia  of  humans  in  surgery  by  the  analysis  of  multi¬ 
channel  EEGs.  Application  of  the  method  exploits  time  domain  formulas 
for  the  Kullback  Leibler  number  between  multivariate  stationary  Gaussian 
time  series. 

Section  2  describes  the  nearest  neighbor  time  series  classification 
rule  with  Kullback  Leibler  type  dissimilarity  measure.  An  implementation 
and  interpretation  of  the  nearest  neighbor  Kullback  Leibler  classification 
rule  for  the  classification  of  stationary  time  series  is  in  Section  3. 

Also  in  that  section,  a  careful  distinction  is  made  between  our  own  use 
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of  nearest  neighbor  Kullback  Leibler  type  dissimilarity  classification 
rules  and  similarly  designated  feature  analysis-discriminant  analysis 
classification  procedures  that  are  common  in  speech  processing.  The 
anesthesia  level  classification  by  EEG  time  series  population  screening 
problem  example  is  in  Section  4.  An  appendix  shows  both  the  metric 
properties  of  the  Kullback  Leibler  type  dissimilarity  measure  and  time 
and  frequency  domain  Kullback  Leibler  number  formulas  for  multivariate 
stationary  Gaussian  time  series.  Relatively  nontechnical  discussions 
of  the  problem  discussed  in  this  paper  appear  in  Gersch  et  al  1979 
and  Gersch  et  al  1980. 
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2.  NEAREST  NEIGHBOR  RULE  CLASSIFICATION  WITH  A  KULLBACK  LIEBLER  TYPE 


DISSIMILARITY  MEASURE. 

Let  the  labeled  sample  time  series  be 


(1)1 

Lao] 

X 

X 

e(1)j 

>•••» 

e«> 

S. 

(2.1) 


x(m)  „  (x(m)(l),...,x(m)(T)),  x(m)(t)  -  (x^t),...^^))' 

6(m)  £  (l, . . . ,M)  * 

In  equation  (2.1)  x^  denotes  a  d  variable  -  T  duration  time  series, 
the  '  denotes  the  matrix  transpose  and  8^  denotes  the  label  or 
category  of  the  m-th  time  series.  There  are  M  alternative  categories. 
Designate  a  new  to-be-classified  time  series 


(0), 


(xv~'(l),...,xvw'(T))  . 


(2.2) 


The  nearest  neighbor  classification  rule  is:  Let  d(x 


(o)  (m) 


)  be  a 


measure  of  dissimilarity  between  the  new  time  series  x 
labeled  time  series  x^m\  for  m  =  1,...,N. 


(o) 


and  the 


If:  d(x^°\x^m  ^ )  £  d(x^  ,x^  )  m»l,...,N 

Then:  0(o)  -  9(m,) 


(2.3) 


That  is,  the  new  series  x^0^  is  given  the  label  of  its  nearest 
dissimilarity  measure  time  series. 
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The  dissimilarity  measure  between  time  series  that  we  employ  for 
classification  is  an  estimate  of  the  Kullback  Leibler  number  or  I-diver- 
gence  between  time  series,  computed  as  if  the  time  series  were  Gaussian 
distributed.  Let  X  and  X  be  two  d-vector  random  variabes  with 


o  m 

probability  density  functions  f  and  f^  respectively.  Then,  the 

I-divergence  between  f  and  f  is,  Kullback,  1968 

o  n 


f  (x) 

I(fo’fm>  “  /fo(x)  log  ~ 00 

m 


dx  . 


(2.4) 


In  particular,  let  X  ^  1K\i  ,E  )  and  X  »E  )•  That  is  let 

o  o  o  m  mm 

XQ  and  X^  each  be  normally  distributed  with  d-component  zero-mean 
vectors  and  d*d  covariance  matrices  respectively.  In  that 

case,  from  Kullback  1968 


2  I(f  ,f  )  -  log 
o  m 


\Z  [ 

TTT+  trE 


E  -  d 
m  o 


(2.5) 


In  equation  (2.5)  and  subsequently,  the  notation  |a|,  tr(A),  A 
denotes  respectively  the  determinant,  trace,  inverse  of  the  matrix 
A. 

Consider  the  d  variate-T  duration  labeled  sample  time  series  x 

(o) 


(m) 


m  »  1,...,N  and  the  new  time  series  x 


Let  Ej ,  j  »  o  or  m  be 


the  sample  or  estimated  covariance  matrices  respectively  of  x 


(j) 


with 


E,  ■  [y  1;  and  y  ,  the  r-c  row-column  element  of  E. .  Then, 
j  r,c  r,c’  j 

let 


(  (o)  (mk  _  _L_ 
,x  ;  2Td 


|l  I  i 

In  -t~-  +  tr  E  E  -  dT 

l=J 


m' 


(2.6) 
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denote  a  measure  of  the  dissimilarity  computed  between  the  sample  time 
series  x^0^  and  x^m\  That  is,  the  dissimilarity  measure  d(x^°\x^m^) 
in  equation  (2.6)  is  computed  from  the  sample  time  series  to  mimic  equation 
(2.5),  as  if  the  time  series  were  Gaussian  distributed. 

Comments: (1)  The  I-divergence  or  Kullback  Leibler  information  number 

(also  the  information  for  discrimination,  information  gain  or  entropy  of 

f  relative  to  f  )  has  a  basic  role  in  the  information  theoretic  approach 
o  m 

to  statistics,  and  in  statistical  physics  as  maximization  of  entropy 
Kullback,  1968,  Good,  1963,  Jaynes,  1957.  The  I-divergence  does  not 
satisfy  the  triangle  inequality  and  is  not  a  metric.  Certain  analogies 
do  exist  between  the  properties  of  probability  density  functions  and 
Euclidean  geometry,  wherein  I-divergence  plays  the  role  of  squared 
Euclidean  distance,  Csiszar,  1975. 

(2)  In  Appendix  1  it  is  shown  that  the  dissimilarity  measure  in 
equation  (2.6)  has  sufficient  metric  properties  for  the  formal  nearest 
neighbor  statistical  classification  properties  to  hold.  Those  properties 
are  that  the  asymptotic  probability  of  misclassif ication  is  bounded  between 
the  Bayes  risk  and  twice  the  Bayes  risk.  Cover  and  Hart  1967,  and  the 
0(1/N)  finite  labeled  sample  cross-validation-leave  out  one-at-a-time 
classification  of  the  labeled  sample  data  set  to  estimate  the  probability 
of  misclassif ication,  Cover  1969  and  Rogers  1977. 

(3)  The  cross  validation  estimate  of  the  probablity  of  misclassi- 
f ication  permits  the  implicit  conjecture  in  the  exploratory  population 
screening  problem  investigation,  that  there  is  sufficient  evidence  in  the 
measurement  data  to  achieve  statistically  satisfactory  discrimination,  to 
be  tested  with  only  a  moderate  number  of  labeled  samples. 
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(4)  Another  classification  problem  of  interest  is  the  "normalized 
baseline"  time  series  classification  problem.  That  problem  situation  is 
dominated  by  intrasubject  categorical  time  series  variability.  An  appli¬ 
cation  of  nearest  neighbor  Kullback  Leibler  type  dissimilarity  measures 
to  the  classification  of  faults  in  relating  machinery  in  a  normalized 
baseline  classif iation  problem  context  is  in  Gersch  et  al  1980b. 
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3.  IMPLEMENTATION  AND  INTERPRETATION  OF  THE  NEAREST  NEIGHBOR  TIME 
SERIES  CLASSIFICATION  RULE. 

The  formula  for  the  dissimilarity  between  the  T-duration  d-variable 

sample  time  series  x^°^  and  in  equation  (2.6)  indicates  operations 

on  matrices  of  size  Td*Td.  Almost  invariably  direct  computations  on  such 

sized  matrices  is  forbidding.  Alternatively,  explicit  time  and  frequency 

domain  formula  for  the  specific  situation  of  the  Kullback  Leibler  number 

between  multivariate  stationary  ergodic  Gaussian  distributed  time  series 

are  of  interest.  Such  formulas  are  developed  in  Appendix  2.  Mimicing 

those  formulas  yields  practical  implementable  dissimilarity  measure 

computations  that  only  involve  operations  on  d*d  matrices.  A  tried 

and  recommended  procedure  for  computing  those  dissimilarity  measures 

involves  the  parametric  autoregressive  (AR)  modelling  of  the  and 

/  \ 

x^  time  series. 

For  example,  consider  the  d-variate  time  series  x^\  for  j  =  0 
or  m  and  let 


-  1 
x 


f  l  *(3><x) 

t=l 


T-k 


f(j)(k)=^  l  <x(j)(t+k)  -x)(x(j)(t) -x)\  k=  0,1,...  (3.1) 


t=l 


denote  the  sample  mean  and  sample  covariance  of  the  j-th  time  series. 


Then,  the  autoregressive  model  of  order  fitted  to  x 

satisfies. 


(j) 


i=0 


(3.2) 


E[e(j)(t)]  =  0,  E[e(j)(t+k)e(j)(t)’]  =  V 
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In  equation  (3.2)  x^(t)  and  e^\t)  are  d-vectors  and  A^^(i) 

are  d  *  d  matrices.  The  AR  model  in  equation  (3.1)  may  be  fitted 
to  the  labeled  sample  time  series  x^m\  m  =  1,...,N  and  the  new  time 
series  x^°^  by  employing  the  Whittle-Robinson  recursive  model  computa¬ 
tion  —  Akaike  AIC  criterion  model  order  selection  procedure,  Whittle 
1963,  Akaike  1974.  The  fitting  of  multivariate  AR  models  to  data  and 
illustrative  examples  are  shown  in  Akaike  1976  and  Gersch  and  Yonemoto 
1977. 

Then,  a  computationally  convenient  dissimilarity  measure  between 
the  time  series  x  and  x  is 


P  P 

m 


2d(x(o),x(m))  =  An  -A-  +  tr(  £  £  A(m) (i)T(o) (j-i)A(m) ' (j )V  X)  -  d 

I V  |  i=0  j=0 

m 

(3.3) 


Equation  (3.3)  only  involves  operations  on  d*d  matrices.  It  mimics  the 
second  time  domain  formula  in  Appendix  2  for  the  computation  of  Kullback 
Leibler  numbers  between  the  probability  density  functions  of  Gaussian 
distributed  zero-mean  stationary  time  series.  The  finite  duration  multi¬ 
variate  time  series  x^  and  x^  are  modeled  by  finite  order  auto¬ 
regressive  models.  In  equation  (3.3),  the  hatted  quantities  are 

estimates  of  the  corresponding  theoretical  quantities,  p^  is  the  order 

/  \  A  /  \ 

of  the  AR  modeled  time  series  x^  and  i  '(’)  is  the  sample  covariance 
matrix  function  of  the  new  time  series  x^°^. 

Figure  1  shows  a  schematic  implementation  of  the  computation  of  the 
dissimilarity  measure  between  the  new  time  series  x^°^  and  the  labeled 
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sample  time  series  for  m  =  1  .  AR  models  of  the  and  the 

labeled  >  ^uipie  x^m^'s  are  assumed.  Application  of  the  new  time  series 

x^°\t)  to  the  m-th  AR  model  yields  the  residual  time  series 
e(°,m)(t),  t  =  The  dissimilarity  measure,  d(x^°^ ,x^m^ ) ,  can 

also  be  expressed  in  terms  of  a  formula  involving  the  residual  time  series 
e(°»o)(t)  e^0,m\t),  Gersch  1977.  The  term  residual  is  the  quantity 

remaining  or  not  explained  after  a  particular  model  is  fitted  to  the  data. 
If  one  of  the  labeled  sample  AR  time  series  models  is  precisely  the  AR 
model  that  corresponds  to  the  generation  of  the  x^°\t)  data,  the  corres¬ 
ponding  residual  sequence  will  be  a  white  noise  sequence.  In  that  sense, 
the  nearest  neighbor  rule  selects  the  "closest  to  whiteness"  residual 
sequence. 

More  concisely,  the  AR  models  of  the  labeled  time  series  sample  can 

be  interpreted  as  templates  of  those  time  series.  In  effect,  in  the 

nearest  neighbor  classification  procedure,  the  new  time  series  is  compared 
against  the  templates  of  the  labeled  sample  time  series.  The  most  similar 
template  is  the  one  for  which  the  dissimilarity  measure  is  smallest. 
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(N) 


) 
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Figure  1.  A  schematic  Implementation  of  a  time  series  data 
nearest  neighbor  rule  classification  procedure. 


4. 


AN  ANESTHESIA  LEVEL  CLASSIFICATION  BY  EEG  ANALYSIS  POPULATION 
SCREENING  PROBLEM. 


An  exploratory  EEG  time  series  data-population  screening  classifica¬ 
tion  problem  is  treated  by  the  nearest  neighbor  rule  approach.  The 
category  or  state  of  an  individual  is  classified  by  comparison  of  his 
or  her  EEG  with  EEGs  taken  from  other  individuals.  The  automatic  classi¬ 
fication  of  anesthesia  levels  Ll  and  L3,  respectively  the  anesthesia 
levels  insufficient  for  and  sufficient  for  deep  surgery  by  machine  computa¬ 
tions  on  the  EEG  alone  is  considered.  Extension  of  the  nearest  neighbor 
rule  approach  to  distinguish  between  more  than  two  categories  or  anesthesia 
levels  does  not  involve  any  new  concepts. 

The  anesthesia  level  EEG  data  originated  in  an  experiment  at  Vancouver 
General  Hospital.  280  epochs  of  visually  screened,  relatively  artifact 
free,  stationary  halothane-nitrous  oxide  anesthesia  level  labeled  EEGs 
were  collected  from  twenty  different  individuals  in  surgery.  The  non-EEG 
criteria  determined  anesthesia  levels  were  classified  by  a  single  anesthe¬ 
siologist  to  eliminate  the  problem  of  inter-EEG-rater  variability.  Details 
of  the  experimental  surgical  anesthesia  situation  and  a  review  of  the 
status  of  automatic  classification  of  anesthesia  levels  using  EEG  data 
appear  elsewhere,  McEwen,  1975a, b  and  Gersch  et  al  1980a.  The  data  con¬ 
sisted  of  64  second  recordings  of  four  channel  EEG  epoch  data,  (F4-C4, 

F3-C3,  C4-02,  and  C3-01  in  the  10-20  EEG  system),  analogue-FM  recorded 
through  a  0.54  to  30  Hz.  bandpass  filter  and  subsequently  digitially  tran¬ 
scribed  at  the  rate  of  128  samples/ second.  An  examination  of  the  avail¬ 
able  data  suggested  that  we  confine  our  attention  to  a  two  category  clas¬ 
sification  problem,  to  classify  the  anesthesia  levels  Ll  and  L3  respectively. 
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the  anesthesia  levels  that  are  insufficient  and  just  sufficient  for 
deep  surgery.  The  data  selected  for  analysis  was  the  73  EEG  epochs 
comprised  of  all  the  35-L1  EEG  epochs  available  and  38-L3  EEG  data 
epochs  (in  sets  of  2-3  epochs  per  individual)  from  a  total  of  18  different 
individuals.  The  analysis  was  performed  on  the  first  twenty  second  inter¬ 
vals  of  each  EEG  data  epoch  at  a  reduced  data  rate  of  128/3  samples  per 
second  on  d  =  4  EEG  data  channel  and  d  =  2  EEG  data  channel  (C4-02  and 
C3-01)  data.  This  constitutes  the  labeled  sample  data  base. 

The  implicit  conjecture  in  the  EEG  population  screening  problem  is 
that  there  is  sufficient  information  in  the  EEG  alone  to  achieve  clinic¬ 
ally  acceptable  levels  of  discrimination  between  categorical  EEG  states. 
The  credibility  of  this  conjecture  is  strained  by  evidence  of  the  broad 
intersubject  categorical  EEG  variability.  Figure  2,  2-channel  twenty 
second  anesthesia  level  LI  and  L3  EEG  epochs  from  five  different  subjects 
suggests  that  the  EEG  of  an  individual  does  differ  in  the  LI  and  L3  anes¬ 
thesia  level  states  and  also  illustrates  broad  intersubject  EEG  variabi¬ 
lity.  The  LI  EEGs  appear  to  be  relatively  homogeneous  "fast"  EEGs  whereas 
the  L3  EEGs  include  fast,  slow,  regular  and  irregular  EEGs.  No  obvious 
visual  properties  of  the  EEGs  distinguish  the  LI  and  L3  EEGs  from  each 
other. 

A  useful  statement  of  the  conjecture  in  the  EEG  population  screening 
problem  is:  Given  labeled  EEG  samples  from  two  categorical  populations, 
estimate  the  theoretically  best  achievable  statistical  classification  per¬ 
formance.  The  use  of  the  KL  number  type  metric  in  NN  rule  classification, 
in  a  delete-one  subject's  EEG-at-a-time  KL-NN  and  KL-kNN  classification  of 
the  labeled  EEG  sample  base,  yields  that  estimate.  (See  Duda  and  Hart  1973 
for  kNN  rules). 
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V6H  -  HA  (F  4,  LI,  S4Q)  0-20  SECS. 


VGH  -  HA  (F  22,  U,  S42)  0-20  SECS. 

g 

VGH  -  HA  (F  97,  Q,  S52)  0-20  SECS. 

VGH  -  HA  (144,  LI,  S71)  0-20  SECS. 

VGH  -  HA  (F184,  U,  S73)  0-20  SECS, 
4 


V6H  -  HA  (F  2,  13,  S40)  0-20  SECS. 


VGH  -  HA  (F  26,  13,  S42)  0-20  SECS. 

M/^vWW'JVA at\/^W^Vv^^''V^ 

VGH  -  HA  (F101,  3,  S52)  0-20  SECS. 

VGH  -  HA  (F145,  13,  S71)  0-20  SECS. 

VGH  -  HA  (F  4,  U,  S40)  0-20  SECS. 

4#nfw*4^ 


Figure  2.  Two  channel  EEG  time  series  from  five  different  individuals 
in  each  of  the  anesthesia  level  states  LI  and  L3. 
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To  achieve  a  baseline  appraisal  of  the  achievable  discriminability 
between  the  LI  and  L3  anesthesia  level  EEC  sample  populations,  the  EEC 
epochs  of  a  single  Individual  at  a  time  were  deleted  from  the  18  individual 
-  73  epoch  labeled  sample  EEG  data.  Each  of  the  deleted-individual 's  EEC 
epochs  was  classified  against  the  remaining  17  individual  labeled  EEG 
sample  population  using  KL-NN  and  KL-kNN  rules.  The  results  obtained  are 
shown  in  Table  1.  The  entries  in  the  table  indicate  the  number  of  classi¬ 
fication  errors  and  the  percentage  of  correct  classification  for  the  best 
d  “  2  EEG  channel  and  d  *  4  EEG  channel  KL-NN  classification  performance. 
The  best  classification  results  for  the  d  ■  2  and  d  »  4  EEG  data  channels 
was  85%  and  89%  overall  correct  classification  respectively. 


TABLE  1:  DELETE  ONE-SUB JECT-AT-A-T1ME,  KL-NN  RULES  RESULTS 
KL-3NNN;  d  -  2  KL-NN;  d  -  4 


The  objectives  of  this  exploratory  population  screening  anesthesia 
level  classif icacion  by  EEG  analysis  study  have  been  very  clearly  met. 
with  only  a  moderate  sized  label  sample  data  base,  the  results  obtained 
quite  reliably  suggest  that  the  population  screening  anesthesia  level 
classification  by  EEG  analysis  scenario  has  substantial  possibilities  for 
clinical  applications. 
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Consents:  1)  Additional  considerations  for  the  implementat ion  of 
nearest  neighbor  rules  in  automatic  EEC  classification  such  as  the  conse¬ 
quences  of  alternative  EEC  normalizations  on  classification  performance 
and  nearest  neighbor  data  thinnings  analysis  considerations  to  economize 
on  computational  and  storage  burdens  are  examined  in  Cersch  et  al  1980a. 
Briefly,  any  comparison  of  EEC  time  series  is  subject  to  arbitrary  con¬ 
ventions,  criteria  and  normalizations.  The  alternative  normalizations  of 
the  EEC  chat  are  possible  in  nearest  neighbor  rule  classification  are 
explicit  in  equation  (3.3),  the  dissimilarity  measure  formula  between 
stationary  time  series  EEGs.  The  alternative  normall.at ions  influence 
the  relative  dominance  of  the  first  and  second  terms  in  that  equation. 

The  related  subject  of  distortion  measures  for  speech  processing  is 
treated  by  Gray  et  al  1980. 

2)  Time  series  classification  by  nearest  neighbor  rules  with 
Kullback  Leibler  type  dissimilarity  measures  for  classification  are  very 
well  known  in  speech  processing,  Itakura  and  Saito,  1970,  and  Gray  et 
al  1980.  In  those  applications  KL  number  dissimilar ity  measures 
most  commonly  Involve  the  modelling  of  each  of  the  labeled  sample  and 
new  (scalar)  time  series  by  fixed  order  autoregressive  models. 

Those  AR  model  parameters  or  features  are  transformed  into  the  Kullback 
Leibler  number  measures.  Because  the  order  of  the  AR  models  fitted  to 
each  time  series  is  fixed,  that  classification  procedure  is  potentially 
of  the  feature  analysis-discriminant  analysis  variety.  The  poignant 
remark  by  Cover  1973,  that  the  problem  for  which  that  solution  is  optimum 
is  not  known  is  applicable  here.  Thus  the  usual  speech  processing  adaption 
of  NN-KL  type  metric  classification  has  no  necessary  statistical 


nor  nearly  optimal  statistical  classlf latlon  properties.  An  example 
of  the  mlsclassif lcation  of  time  series  that  results  from  arbitrarily 
fixing  the  order  of  AR  model  fitted  to  the  time  series  is  la 
Brotherton  and  Gersch,  1980. 
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APPENDICES 


1.  THE  METRIC  PROPERTIES  OF  d(x(o) ,x(m) ) . 


Here  we  show  that  the  Kullback  Leibler  type  dissimilarity  measure 
between  time  series  has  sufficient  metric  properties  that  the  formal 
nearest  neighbor  rule  statistical  classification  properties  apply  to 
nearest  neighbor  classification  rules  that  employ  that  measure.  Following 
the  development  of  Cover  and  Hart  1973,  it  is  only  necessary  to  show  that 
i)  d(x(o),x(o))  -  0  ,  ii)  d(x(o),x(m))  >  0  for  any  x(m)  +  x(o),  and 
iii)  the  minimum  value  of  the  dissimilarity  measure  d(x^°\x^m^)  -►  0, 
as  N  the  number  of  labeled  samples  increases  indefinitely.  Property 
i)  is  immediate  from  equation  (2.6).  Property  ii)  is  proved  below 
separately.  To  prove  property  iii);  Let  the  sample  Td x  Td  covariance 

A 

matrix  I  be  distributed  in  accordance  with  distribution  F.  Let 
o 

AAA 

E  E,  _,  E„  _,...  be  I ID  random  variables  from  that  distribution. 
o,T  1,T  Z,T 

Td  x  Td 

Then  the  space  R  on  which  the  sample  covariances  are  defined  is 

a  separable  metric  space  and  the  minimum  Euclidean  distance  between  the 

/S  A 

sample  covariances  goes  to  zero.  That  is,  [[Eq  -  E^,  ||  -*•  0.  Then,  since 


d(x^°\x^m^)  “  d(E  ,E  ,)  is  a  continuous  function  of  E  ,  and  E  ,  -*■  E 


o'  m 


o.T 


in  RTd x  Td  <j(I  y  )  +  o,  Royden,  1968.  Property  ii);  Consider  the 
o,T  m  ,T 

situation  with  E  -  E  and  E  -  E  +  A.  The  matrix  A  denotes  a  small 
o  o  mo 

perturbation  matrix.  For  convenience  subscript  T  and  hat  notation  will 
be  dropped  in  what  follows.  Then 


2Td(x<o),x(m))  -  2Td(E  ,E  +A)  -  [In 

o  o 


E  +A 


+  tr (E  (E  +A)  )-Td ]  . 
o  o 
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We  would  like  to  prove  that  d(£Q,£o+A)  0.  £q  Is  symmetric,  positive 

definite  and  fixed,  £^+A  symmetric  and  positive  definite  and  A  is 

-1/2  -1/2 

"small".  Let  A  -  £  A  £  .  Then  A  is  symmetric,  and  I +  A  is 

o  o 

positive  definite.  Then,  the  last  equation  can  be  written 


f  (A)  -  Jlnll+Al  +  tr (I+A)  -  Td  . 


The  problem  is  now  reduced  to  demonstrating  that  f(A)  is  convex  in 
the  neighborhood  of  A+0.  Let  A  *  A(s)  be  linear  in  s.  Then  by  the 
rules,  d£n|x[  -  tr(X-1)(dX)  and  dX-1  -  -(X-1) (dX) (X-1) ,  Anderson, 
1958, 


—  -  tr  (I+A)'1  -  tr  (I+A)_1  (~)(I+A)-1 


2 

where  (4^)  is  a  symmetric  matrix.  Since  A(s)  is  linear,  -A^f-  *  0, 
as 

as 


d-2f-^-  -  -  tr (I+A)-1  (^)(I+A)-1(~) 
ds 


+  2  tr(I+A)-1  (^)(I+A)-1  (^J)  (I+A)-1 


t r  ( I+A) _1  (|f)  [  2  (I+A) _1  -  I  ]  (^)(I+A)-1 


-  tr  Bt2(I+A)"  -  I]B’  . 
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-1  dA 

In  the  equation  above  we  have  let  B  *  (I+A)  (— )  .  The  right-hand 

ds 

side  of  that  equation  will  be  non-negative  provided  the  term  in  brackets 
in  the  last  row  is  positive  semidef inite.  But  that  is  equivalent  to 
21-  (I+A),  and  the  (I-A)  being  positive  semidef inite.  That  implies 
A  1  in  the  sense  of  positive  definiteness  and  also  -I  A  since 

I+A  _>  0.  Then,  clearly  when  A  =  0,  ^  f(A(s))  >  q  an(j  general 

ds2 

f(A)  is  convex  provided  -I  <  A  <  I. 


A2.  TIME  AND  FREQUENCY  DOMAIN  KULLBACK  LEIBLER  FORMULAS  BETWEEN 
STATIONARY  GAUSSIAN  TIME  SERIES. 

TIME  SERIES  REPRESENTATIONS.  Let  (x(o)(t)}  and  {x(m)(t)}  denote 
d-variabe  zero-mean  stationary  ergodic  Gaussian  time  series  with  corres¬ 
ponding  probability  density  function  f^°\  f^  and  d*d  matrix 
covariance  functions  r^°^(k),  F^(k)  and  power  spectral  density 

matrices  S  (f)  and  S  (f)  respectively.  Identify  the  time  series 
o  m 

x(i)(t)  parametrically  in  terms  of  the  Wold  (moving  average)  and  auto¬ 
regressive  representations.  Whittle,  1963. 


x(i)(t)  =  h(i)(t)  *  e(i)(t)  i  =  0,1,2 . M 

A(i)(t)  *  x(i)(t)  =  E(i)(t); 


E(£(i)(t))  -  0;  E(e(i)(t+k)£(i)(t)’)  =  V±  6 


In  equation  (Al) ,  the  symbol  *  denotes  the  convolution  operation,  E 
is  the  expectation  operator  and  (h^(t)}  and  (A^(t)}  are  respectively 
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I 


Che  dx  d  Impulse  matrix  response  and  AR  matrix  coefficients, 
the  action  of  the  AR  operator  defined  by  x^(t)  on  x^°\t) 

A(m)(t)  *  x(0)(t)  «  e(°*m)(t)  . 

In  equation  (A2) ,  e^°,m^(t)  has  an  interpretation  as  a  zero-mean 

"residual"  time  series  in  the  conventional  sense  of  a  regression  analysis. 
Its  zero-log  covariance  matrix  is 

E(e(°’m)(t)e(o>m)(t)')  =  V°  .  (A3) 

m 

Employing  the  notation  of  equation  (A2)  in  equation  (Al) 

A(m)(t)  *  h(o)(t)  *  eCo)(t)  =  e(°,m)(t) 

h(o’m)(t)  *  e(o)(t)  =  e(o’m)(t)  .  (A4) 

In  equation  (A4),  h^°,m^(t)  designates  the  impulse  response  of  the 
cascade  of  filters  A^m^(t)  and  h^°^(t).  By  elementary  linear  operations, 

h(0’m)(t)  =  h(o)(t)  +  l  A(m)(i)h(o)(t-i),  t  =  0,1,...  (A5) 

i=l 

KULLBACK  LEIBLER  NUMBER  FORMULAS.  Then,  time  and  frequency  domain 
formulas  for  the  Kullback  Leibler  numbers  between  those  Gaussian  time 
series  are: 


Denote 

by 

(A2) 
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2I(f(o),f(m))  =  An  -L4-  +  tr  (  l  h(o,m)(t)V  h(o,m) ' (t)V_1)  -  d 

'V  t=0  m 

I  V  I  00  00 

=  ln-n~r+tr(l  £  A(m)  (i)r(o)  (j-i)A(m)  '  (j  )v71)  -  d 

1  o1  i=0  j=0  J 

|  v  I  r  k 

=  An  -|~j-  +  tr(  j  ^  SQ(f)Sm(f)  1  df)  -d  .  (A6) 


An  intermediate  result,  derived  from  equation  (2.6)  and  the  time 
series  notation  above,  from  which  the  results  in  equation  (A6)  follow 
is  that 


2I(f(o),f(m))  = 


m  , 

KT  + 


tr[V°  V  1] 
m  m 


(A7) 


Then,  the  first  two  parametric  time  domain  formulas  for  the  Kullback 
Leibler  number  between  stationary  Gaussian  time  series  in  equation  (A6) 
may  be  derived  from  equation  (A7)  by  replacing  by  its  definition, 

equation  (A 3)  and  then  substituting  for  by  its  representations 

in  equations  (A4)  and  (A2)  and  taking  the  indicated  expectations.  The 
frequency  domain  formula,  the  third  line  in  equation  (A6)  is  obtained 
from  equation  (A7)  by  the  use  of  Parseual's  theorem  and  the  assumption 
of  ergodicity. 

Conments:  The  first  development  of  a  frequency  domain  formula  for  the 
Kullback  Leibler  number  between  Gaussian  distributed  time  series  was 
probably  due  to  Pinsker,  1964,  that  work  appears  to  have  remained  almost 
unknown  to  Western  researchers.  Subsequently  frequency  domain  formulas 
were  developed  by  Shumway  and  Unger,  1974,  Hawkes  and  Moore,  1976  and 
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B.D.O.  Anderson  et  al  1978. 


The  first  time  domain  formula  for  the 


Kullback  Leibler  number  between  scalar  time  series  is  probably  due  to 
Itakura  and  Saito  1968,  in  their  search  for  distance  measures  in  speech 
classification.  A  complete  and  up-to-date  treatment  of  that  approach 
is  in  Gray  et  al  1980.  Akaike,  1976  shows  different  development  of  the 
second  time  domain  formula  in  equation  (A6) .  That  development  has 
attracted  little  attention. 


Will  Gersch 

Dept,  of  Information  & 
Computer  Science 
University  of  Hawaii 
Honolulu,  Hawaii  96822 
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